VLDeformer: Vision–Language Decomposed Transformer for fast cross-modal retrieval
نویسندگان
چکیده
Cross-model retrieval has emerged as one of the most important upgrades for text-only search engines (SE). Recently, with powerful representation pairwise text–image inputs via early interaction, accuracy vision–language (VL) transformers outperformed existing methods retrieval. However, when same paradigm is used inference, efficiency VL still too low to be applied in a real cross-modal SE. Inspired by mechanism human learning and using knowledge, this paper presents novel Vision–Language Decomposed Transformer (VLDeformer), which greatly increases while maintaining their outstanding accuracy. By proposed method, cross-model separated into two stages: transformer stage, decomposition stage. The latter stage plays role single modal indexing, some extent like term indexing text model learns knowledge from early-interaction pre-training then decomposed an individual encoder. VLDeformer requires only small target datasets supervision achieves both 1000+ times acceleration less than 0.6% average recall drop. also outperforms state-of-the-art visual-semantic embedding on COCO Flickr30k.1
منابع مشابه
Cross-Modal Manifold Learning for Cross-modal Retrieval
This paper presents a new scalable algorithm for cross-modal similarity preserving retrieval in a learnt manifold space. Unlike existing approaches that compromise between preserving global and local geometries, the proposed technique respects both simultaneously during manifold alignment. The global topologies are maintained by recovering underlying mapping functions in the joint manifold spac...
متن کاملMHTN: Modal-adversarial Hybrid Transfer Network for Cross-modal Retrieval
Cross-modal retrieval has drawn wide interest for retrieval across different modalities of data (such as text, image, video, audio and 3D model). However, existing methods based on deep neural network (DNN) often face the challenge of insufficient cross-modal training data, which limits the training effectiveness and easily leads to overfitting. Transfer learning is usually adopted for relievin...
متن کاملA Comprehensive Survey on Cross-modal Retrieval
In recent years, cross-modal retrieval has drawn much attention due to the rapid growth of multimodal data. It takes one type of data as the query to retrieve relevant data of another type. For example, a user can use a text to retrieve relevant pictures or videos. Since the query and its retrieved results can be of different modalities, how to measure the content similarity between different m...
متن کاملCross-Modal Retrieval: A Pairwise Classification Approach
Content is increasingly available in multiple modalities (such as images, text, and video), each of which provides a different representation of some entity. The cross-modal retrieval problem is: given the representation of an entity in one modality, find its best representation in all other modalities. We propose a novel approach to this problem based on pairwise classification. The approach s...
متن کاملHeterogeneous Metric Learning for Cross-Modal Multimedia Retrieval
Due to the massive explosion of multimedia content on the web, users demand a new type of information retrieval, called cross-modal multimedia retrieval where users submit queries of one media type and get results of various other media types. Performing effective retrieval of heterogeneous multimedia content brings new challenges. One essential aspect of these challenges is to learn a heteroge...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Knowledge Based Systems
سال: 2022
ISSN: ['1872-7409', '0950-7051']
DOI: https://doi.org/10.1016/j.knosys.2022.109316